基于协同语义与几何增强的可供性学习网络

doi:10.16451/j.cnki.issn1003-6059.202603007

摘要
图/表
参考文献
相关文章 (9)

全文: PDF (2277 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要开放词汇3D可供性检测作为连接高层语义理解与底层机器人操作的关键纽带,旨在赋予具身智能体在非结构化环境中响应自然语言指令并精准定位物体功能区域的能力.然而,现有方法多依赖冻结的预训练视觉-语言模型进行浅层特征匹配,面临文本指令语义歧义性与特征空间几何-语义错位的双重挑战,导致模型泛化性不足.为此,文中提出基于协同语义与几何增强的可供性学习网络(Synergistic Semantic and Geometric Enhancement Based Affordance Learning Network, SSGE-Net).首先,构建物理感知语义增强模块,生成包含几何约束、功能描述及交互逻辑的结构化三元组,实现语义致密化,弥补指令信息的缺失.然后,设计多尺度几何感知细化模块,利用局部动态图卷积与全局自注意力机制捕获互补的拓扑细节,增强几何特征的判别力.最后,提出基于Transformer解码器的深度跨模态对齐模块,利用交叉注意力,根据文本指令动态重构点云特征,实现语义引导下的精准锚定.在3D AffordanceNet数据集上的广泛实验表明,SSGE-Net在全视图任务及部分视图任务上性能均有一定提升,验证其在复杂视点及长尾类别场景中的优越性与鲁棒性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	杨泽存
	赵纪元
	赵雨晨
	李忠豪
	卓鹏涛

关键词 ：开放词汇, 3D可供性检测, 具身智能, 大语言模型, 跨模态对齐

Abstract：Open vocabulary 3D affordance detection is regarded as a critical link connecting high-level semantic understanding and low-level robotic manipulation to precisely localize object functional regions in unstructured environments. However, existing methods mostly rely on frozen pre-trained vision-language models for shallow feature matching. The generalization ability of these models is insufficient due to the dual challenges of semantic ambiguity in text instructions and geometry-semantic misalignment in feature space. To address these issues, a synergistic semantic and geometric enhancement based affordance learning network(SSGE-Net) is proposed in this paper. First, a physics-aware semantic enhancement module is constructed to generate structured triplets of geometric constraints, functional descriptions and interaction logic. Semantic densification is achieved with these triplets. Thus, the lack of instruction information is compensated for. Second, a multi-scale geometry refinement mechanism is designed. Complementary topological details are captured by utilizing local dynamic graph convolution and global self-attention mechanisms to enhance feature discriminability. Finally, a deep cross-modal alignment mechanism based on Transformer decoders is proposed. Point cloud features are dynamically reconstructed by cross-attention under semantic guidance to achieve precise anchoring. Extensive experiments on 3D AffordanceNet dataset demonstrate that SSGE-Net achieves consistent performance improvements under both full-view and partial-view settings. These results validate its superiority and robustness in complex viewpoints and long-tail category scenarios.

Key words： Open Vocabulary 3D Affordance Detection Embodied Artificial Intelligence Large Language Model Cross-Modal Alignment

收稿日期: 2026-01-05

ZTFLH:

TP391.4

通讯作者: 赵纪元,博士,教授,主要研究方向为智能检测与控制、机械制造与自动化.E-mail:jiyuan.zhao@bistu.edu.cn.

作者简介: 杨泽存,硕士研究生,主要研究方向为开放词汇检测、智能感知.E-mail:18253140199@163.com.
赵雨晨,硕士研究生,主要研究方向为模式识别.E-mail:2358797739@qq.com.
李忠豪,硕士研究生,主要研究方向为人工智能.E-mail:827918162@qq.com.
卓鹏涛,硕士研究生,主要研究方向为智能控制.E-mail:1461156908@qq.com.

引用本文:

杨泽存, 赵纪元, 赵雨晨, 李忠豪, 卓鹏涛. 基于协同语义与几何增强的可供性学习网络[J]. 模式识别与人工智能, 2026, 39(3): 278-288. YANG Zecun, ZHAO Jiyuan, ZHAO Yuchen, LI Zhonghao, ZHUO Pengtao. Synergistic Semantic and Geometric Enhancement Based Affordance Learning Network. Pattern Recognition and Artificial Intelligence, 2026, 39(3): 278-288.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202603007 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2026/V39/I3/278

[1] LIU Y, CHEN W X, BAI Y J, et al. Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. IEEE/ASME Transactions on Mechatronics, 2025, 30(6): 7253-7274.
[2] THERMOS S, PAPADOPOULOS G T, DARAS P, et al. Deep Affor-dance-Grounded Sensorimotor Object Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 49-57.
[3] QI C R, YI L, SU H, et al. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 5105-5114.
[4] WU J Z, LI X T, XU S L, et al. Towards Open Vocabulary Lear-ning: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(7): 5092-5113.
[5] 聂秀山,赵润虎,宁阳,等.开放词汇目标检测方法综述.山东大学学报(工学版), 2025, 55(1): 1-14.
(NIE X S, ZHAO R H, NING Y, et al. Survey of Open Vocabulary Object Detection Methods. Journal of Shandong University(Engineering Science), 2025, 55(1): 1-14.)
[6] ZHU C Y, CHEN L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 8954-8975.
[7] RADFORD A, KIM J W, HALLACY C, et al. Learning Transferable Visual Models from Natural Language Supervision // Proc of the 38th International Conference on Machine Learning. San Diego, USA: JMLR, 2021: 8748-8763.
[8] NGUYEN T, VU M N, VUONG A, et al. Open-Vocabulary Affordance Detection in 3D Point Clouds // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2023: 5692-5698.
[9] LI C M, ZHU Y C, WEN J J, et al. PointVLA: Injecting the 3D World into Vision-Language-Action Models[C/OL]. [2025-12-16]. https://arxiv.org/pdf/2503.07511.
[10] FENG T T, WANG X, JIANG Y G, et al. Embodied AI: From LLMs to World Models[C/OL]. [2025-12-16]. https://arxiv.org/pdf/2509.20021v1.
[11] 曹振中,光金正,张千一,等.基于3D高斯溅射的3维重建技术综述.机器人, 2024, 46(5): 611-622.
(CAO Z Z, GUANG J Z, ZHANG Q Y, et al. Survey of 3D Reconstruction Techniques Based on 3D Gaussian Splatting. Robot, 2024, 46(5): 611-622.)
[12] DENG S H, XU X, WU C Z, et al. 3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 1778-1787.
[13] CHERAGHIAN A, RAHMAN S, PETERSSON L. Zero-Shot Lear-ning of 3D Point Cloud Objects // Proc of the 16th International Conference on Machine Vision Applications. Washington, USA: IEEE, 2019. DOI: 10.23919/MVA.2019.8758063.
[14] CHERAGHIAN A, RAHMAN S, CAMPBELL D, et al. Transductive Zero-Shot Learning for 3D Point Cloud Classification // Proc of the IEEE Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2020: 912-922.
[15] MICHELE B, BOULCH A, PUY G, et al. Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Clouds // Proc of the International Conference on 3D Vision. Washington, USA: IEEE, 2021: 992-1002.
[16] VAN VO T, VU M N, HUANG B R, et al. Open-Vocabulary Affordance Detection Using Knowledge Distillation and Text-Point Correlation // Proc of the IEEE International Conference on Robo-tics and Automation. Washington, USA: IEEE, 2024: 13968-13975.
[17] NGUYEN N, VU M N, TA T D, et al. Robotic-CLIP: Fine-Tu-ning CLIP on Action Data for Robotic Applications // Proc of the IEEE International Conference on Robotics and Automation. Wa-shington, USA: IEEE, 2025: 5930-5936.
[18] LIAO G B, ZHOU K C, BAO Z Y, et al. OV-NeRF: Open-Vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(12): 12923-12936.
[19] SHAO Y W, ZHAI W, YANG Y H, et al. GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 17326-17336.
[20] LU D Y, KONG L D, HUANG T X, et al. GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 1680-1690.